Collaborative Filtering with Python (Python 2.7)

The Last.FM dataset

The data set contains information about users, gender, age, and which artists they have listened to on Last.FM. In our case we only use Germany’s data and transform the data into a frequency matrix.

We will use this to complete 2 types of collaborative filtering:

Item Based: which takes similarities between items’ consumption histories
User Based: that considers similarities between user consumption histories and item similarities



In [5]:

    
import pandas as pd
from scipy.spatial.distance import cosine

# Data was already dlownloaded.
data = pd.read_csv('data/lastfm/lastfm-matrix-germany.csv')

# check out the data set you can do so using data.head():
data.head(6).ix[:,2:10]









    Out[5]:






  
    
      
      abba
      ac/dc
      adam green
      aerosmith
      afi
      air
      alanis morissette
      alexisonfire
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      5
      0
      0
      0
      0
      0
      0
      0
      1

Item Based Collaborative Filtering



In [6]:

    
#In item based collaborative filtering we do not care about the user column.
data_germany = data.drop('user', 1)



In [7]:

    
#Create a placeholder dataframe listing item vs. item
data_ibs = pd.DataFrame(index=data_germany.columns, columns=data_germany.columns)

Now we can start to look at filling in similarities. We will use Cosin Similarities. In Python, the Scipy library has a function that allows us to do this without customization.

In essense the cosine similarity takes the sum product of the first and second column, then dives that by the product of the square root of the sum of squares of each column.



In [8]:

    
# Lets fill in those empty spaces with cosine similarities
# Loop through the columns

for i in range(0, len(data_ibs.columns)) :
    # Loop through the columns for each column

    for j in range(0,len(data_ibs.columns)) :
        # Fill in placeholder with cosine similarities
        
        data_ibs.ix[i,j] = 1 - cosine(data_germany.ix[:,i], data_germany.ix[:,j])



In [10]:

    
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,11))
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index
 
# --- End Item Based Recommendations --- #









    



/Users/xiaoweiyang/py27/lib/python2.7/site-packages/ipykernel_launcher.py:6: FutureWarning: order is deprecated, use sort_values(...)

With our similarity matrix filled out we can look for each items “neighbour” by looping through ‘data_ibs’, sorting each column in descending order, and grabbing the name of each of the top 10 songs.



In [10]:

    
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns, columns=range(1,11))
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index









    



/Users/xiaoweiyang/py27/lib/python2.7/site-packages/ipykernel_launcher.py:6: FutureWarning: order is deprecated, use sort_values(...)



In [ ]:

    
Show the results!



In [21]:

    
data_neighbours.ix[:10, :5]









    Out[21]:






  
    
      
      1
      2
      3
      4
      5
    
  
  
    
      a perfect circle
      a perfect circle
      tool
      dredg
      deftones
      porcupine tree
    
    
      abba
      abba
      madonna
      robbie williams
      elvis presley
      michael jackson
    
    
      ac/dc
      ac/dc
      red hot chili peppers
      metallica
      iron maiden
      the offspring
    
    
      adam green
      adam green
      the libertines
      the strokes
      babyshambles
      radiohead
    
    
      aerosmith
      aerosmith
      u2
      led zeppelin
      metallica
      ac/dc
    
    
      afi
      afi
      funeral for a friend
      rise against
      fall out boy
      anti-flag
    
    
      air
      air
      massive attack
      goldfrapp
      morcheeba
      thievery corporation
    
    
      alanis morissette
      alanis morissette
      tori amos
      alicia keys
      red hot chili peppers
      kelly clarkson
    
    
      alexisonfire
      alexisonfire
      atreyu
      underoath
      funeral for a friend
      silverstein
    
    
      alicia keys
      alicia keys
      beyonce
      norah jones
      maria mena
      black eyed peas

User Based collaborative Filtering

The process for creating a User Based recommendation system:

Have an Item Based similarity matrix at your disposal (DONE)
Check which items the user has consumed (listened or purchased): if consumed, then we do not recommend.
Otherwise,
- Find the top N neighbours for the current song
- Get the consumption record (#listen) of the user for each neighbour.
- Using similarity scores as weight to average the consumption records.
Recommend the songs with the highest score (i.e., weighted average of consumptions)

We first need a formula. We first calcuate the inner product of two vectors (the one containing purchase history; and the one containing similarity scores to the current song), then divide that figure by the sum of the similarities in the respective vector.



In [60]:

    
# Helper function to get similarity scores
def getScore(history, similarities):
   return sum(history * similarities) / sum(similarities)

The rest is a matter of applying this function to the data frames in the right way. We start by creating a variable to hold our similarity data. This is basically the same as our original data but with nothing filled in except the headers.



In [61]:

    
# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1] = data.ix[:,:1]



In [63]:

    
data_sims.head(3).ix[:, :10]









    Out[63]:






  
    
      
      user
      a perfect circle
      abba
      ac/dc
      adam green
      aerosmith
      afi
      air
      alanis morissette
      alexisonfire
    
  
  
    
      0
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      33
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      42
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

We now loop through the rows and columns filling in empty spaces with similarity scores.

Note that we score items that the user has already consumed as 0, because there is no point recommending it again.



In [64]:

    
#Loop through all rows, skip the user column, and fill with similarity scores
for i in range(0, len(data_sims.index)):
    for j in range(1,len(data_sims.columns)):
        user = data_sims.index[i]
        product = data_sims.columns[j]
 
        if data.ix[i][j] == 1:
            data_sims.ix[i][j] = 0
        else:
            product_top_names = data_neighbours.ix[product][1:10]
            product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
            user_purchases = data_germany.ix[user, product_top_names]
            data_sims.ix[i][j] = getScore(user_purchases, product_top_sims)









    



/Users/xiaoweiyang/py27/lib/python2.7/site-packages/ipykernel_launcher.py:11: FutureWarning: order is deprecated, use sort_values(...)
  # This is added back by InteractiveShellApp.init_path()

We can now produc a matrix of User Based recommendations as follows:



In [68]:

    
# Get the top songs
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.ix[0:,0] = data_sims.ix[:,0]

Instead of having the matrix filled with similarity scores, however, it would be nice to see the song names. This can be done with the following loop:



In [69]:

    
# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
    data_recommend.ix[i,1:] = data_sims.ix[i,:].order(ascending=False).ix[1:7,].index.transpose()









    



/Users/xiaoweiyang/py27/lib/python2.7/site-packages/ipykernel_launcher.py:3: FutureWarning: order is deprecated, use sort_values(...)
  This is separate from the ipykernel package so we can avoid doing imports until



In [70]:

    
# Print a sample
print data_recommend.ix[:4,:5]









    



  user                      1              2                3              4
0    1         flogging molly       coldplay        aerosmith    the beatles
1   33  red hot chili peppers  kings of leon        peter fox      gentleman
2   42                 oomph!    lacuna coil        rammstein     schandmaul
3   51            the subways      the kooks  franz ferdinand      the hives
4   62           jack johnson        incubus       mando diao  the fratellis



In [ ]:

	1	2	3	4	5
a perfect circle	a perfect circle	tool	dredg	deftones	porcupine tree
abba	abba	madonna	robbie williams	elvis presley	michael jackson
ac/dc	ac/dc	red hot chili peppers	metallica	iron maiden	the offspring
adam green	adam green	the libertines	the strokes	babyshambles	radiohead
aerosmith	aerosmith	u2	led zeppelin	metallica	ac/dc
afi	afi	funeral for a friend	rise against	fall out boy	anti-flag
air	air	massive attack	goldfrapp	morcheeba	thievery corporation
alanis morissette	alanis morissette	tori amos	alicia keys	red hot chili peppers	kelly clarkson
alexisonfire	alexisonfire	atreyu	underoath	funeral for a friend	silverstein
alicia keys	alicia keys	beyonce	norah jones	maria mena	black eyed peas

	user	a perfect circle	abba	ac/dc	adam green	aerosmith	afi	air	alanis morissette	alexisonfire
0	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	33	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	user	a perfect circle	abba	ac/dc	adam green	aerosmith	afi	air	alanis morissette	alexisonfire
0	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	33	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	user	a perfect circle	abba	ac/dc	adam green	aerosmith	afi	air	alanis morissette	alexisonfire
0	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	33	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN